# Introduction: What are the most important defensive metrics that can predict a team's number of wins throughout a season? Are these metrics equally as successful in predicting postseason success? It's a common belief that defense wins championships, but this paper will specifically investigate which defensive metrics are the most essential to team performance. As such, only defensive statistics will be included in the analysis. The metrics to be studied include ADJDE (Adjusted Defensive Efficiency),EFG_D (Effective Field Goal Percentage Allowed), TOR_D (Turnover Percentage Committed (Steal Rate)), DRB (Offensive Rebound Rate Allowed), FTRD (Free Throw Rate Allowed), 2P_D (Two-Point Shooting Percentage Allowed), and 3P_D (Three-Point Shooting Percentage Allowed). This dataset is pulled from Kaggle and was compiled by Andrew Sunberg. It contains D1 basketball information from the 2013-2021 seasons.
library(readr)
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(leaps)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(tidyr)
cbb <- read_csv("cbb.csv")
## Rows: 2455 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): TEAM, CONF, POSTSEASON
## dbl (21): G, W, ADJOE, ADJDE, BARTHAG, EFG_O, EFG_D, TOR, TORD, ORB, DRB, FT...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cbb
# Column names must be altered because numbers can't be used in a column name.
colnames(cbb)[17] <- "Two_PD"
colnames(cbb)[19] <- "Three_PD"
cbb_reordered <- cbb %>% relocate(W)
cbb_reordered
# Creation of model predicting wins using defensive metrics
model = lm(W ~ ADJDE + EFG_D + TORD + DRB + FTRD + Two_PD + Three_PD, data = cbb_reordered)
summary(model)
##
## Call:
## lm(formula = W ~ ADJDE + EFG_D + TORD + DRB + FTRD + Two_PD +
## Three_PD, data = cbb_reordered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8018 -3.0304 -0.1155 2.9466 14.2151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.66663 1.88664 49.117 < 2e-16 ***
## ADJDE -0.25005 0.02985 -8.376 < 2e-16 ***
## EFG_D -1.25863 0.44348 -2.838 0.00458 **
## TORD 0.64756 0.05637 11.488 < 2e-16 ***
## DRB -0.52622 0.03523 -14.936 < 2e-16 ***
## FTRD -0.26961 0.01634 -16.503 < 2e-16 ***
## Two_PD 0.42065 0.28968 1.452 0.14659
## Three_PD 0.15268 0.23626 0.646 0.51819
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.325 on 2447 degrees of freedom
## Multiple R-squared: 0.5733, Adjusted R-squared: 0.5721
## F-statistic: 469.7 on 7 and 2447 DF, p-value: < 2.2e-16
# As seen above, every variable used in the dataset is a statistically significant predictor in the number of wins a team gains except for 2 and 3 point shooting percentage allowed. To help determine the most important predictors of those that are significant, a backward selection process will be used.
plot(model)




# Looking at the plot of the residuals, it's clear that this combination of defensive predictors is quite successful in predicting the number of games a team wins throughout a season. The goal is to have as few predictors as possible without compromising the success of the model. Further analysis will determine if we can simplify this model.
# Backward Selection
full_model = lm(W~., data=cbb_reordered)
MSE = (summary(full_model)$sigma)^2
model2 = step(full_model, scale=MSE, trace=FALSE)
summary(model2)
##
## Call:
## lm(formula = W ~ CONF + G + ADJOE + ADJDE + EFG_O + EFG_D + TOR +
## TORD + ORB + DRB + FTRD + WAB + POSTSEASON + YEAR, data = cbb_reordered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3526 -0.6048 -0.0322 0.5911 3.6957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -143.40144 57.47913 -2.495 0.012980 *
## CONFACC -2.20256 0.27177 -8.105 5.69e-15 ***
## CONFAE 1.94386 0.45392 4.282 2.29e-05 ***
## CONFAmer -0.84558 0.30750 -2.750 0.006217 **
## CONFASun 1.73694 0.48127 3.609 0.000344 ***
## CONFB10 -2.25446 0.26628 -8.467 4.15e-16 ***
## CONFB12 -2.52711 0.27350 -9.240 < 2e-16 ***
## CONFBE -1.82688 0.26338 -6.936 1.51e-11 ***
## CONFBSky 2.27003 0.47121 4.817 2.03e-06 ***
## CONFBSth 2.05773 0.44656 4.608 5.38e-06 ***
## CONFBW 1.55779 0.43834 3.554 0.000422 ***
## CONFCAA 2.33344 0.43390 5.378 1.25e-07 ***
## CONFCUSA 1.52720 0.42547 3.589 0.000370 ***
## CONFHorz 1.81028 0.43842 4.129 4.39e-05 ***
## CONFIvy 1.03201 0.45786 2.254 0.024706 *
## CONFMAAC 1.39503 0.44090 3.164 0.001668 **
## CONFMAC 1.43831 0.42490 3.385 0.000778 ***
## CONFMEAC 1.58814 0.52362 3.033 0.002570 **
## CONFMVC 0.95897 0.37582 2.552 0.011069 *
## CONFMWC 0.27149 0.31534 0.861 0.389762
## CONFNEC 1.29242 0.50308 2.569 0.010540 *
## CONFOVC 1.68953 0.43311 3.901 0.000111 ***
## CONFP12 -0.99352 0.26766 -3.712 0.000233 ***
## CONFPat 1.46300 0.45043 3.248 0.001254 **
## CONFSB 1.69726 0.41443 4.095 5.05e-05 ***
## CONFSC 1.89805 0.44062 4.308 2.05e-05 ***
## CONFSEC -1.57011 0.28554 -5.499 6.62e-08 ***
## CONFSlnd 2.77315 0.47940 5.785 1.41e-08 ***
## CONFSum 2.51149 0.44658 5.624 3.39e-08 ***
## CONFSWAC -0.09542 0.51114 -0.187 0.851997
## CONFWAC 1.91404 0.44911 4.262 2.50e-05 ***
## CONFWCC 0.29911 0.36191 0.826 0.409002
## G 0.67163 0.03871 17.352 < 2e-16 ***
## ADJOE -0.18548 0.02960 -6.265 9.13e-10 ***
## ADJDE 0.40456 0.03643 11.106 < 2e-16 ***
## EFG_O 0.31304 0.04387 7.135 4.20e-12 ***
## EFG_D -0.58307 0.04886 -11.934 < 2e-16 ***
## TOR -0.20526 0.04332 -4.739 2.94e-06 ***
## TORD 0.52834 0.04351 12.143 < 2e-16 ***
## ORB 0.11268 0.02112 5.335 1.56e-07 ***
## DRB -0.25210 0.02867 -8.793 < 2e-16 ***
## FTRD -0.05609 0.01059 -5.299 1.88e-07 ***
## WAB 0.89136 0.02581 34.530 < 2e-16 ***
## POSTSEASONChampions 0.42651 0.52655 0.810 0.418391
## POSTSEASONE8 -1.69704 0.41963 -4.044 6.24e-05 ***
## POSTSEASONF4 -0.93141 0.45443 -2.050 0.041015 *
## POSTSEASONR32 -2.14192 0.42263 -5.068 6.01e-07 ***
## POSTSEASONR64 -2.41064 0.44449 -5.423 9.83e-08 ***
## POSTSEASONR68 -2.46859 0.50580 -4.881 1.50e-06 ***
## POSTSEASONS16 -1.61716 0.41120 -3.933 9.80e-05 ***
## YEAR 0.06873 0.02858 2.404 0.016626 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9529 on 425 degrees of freedom
## (1979 observations deleted due to missingness)
## Multiple R-squared: 0.9546, Adjusted R-squared: 0.9493
## F-statistic: 178.9 on 50 and 425 DF, p-value: < 2.2e-16
# The defensive metrics chosen according to the backwards selection process are ADJDE, EFG_D, TORD, DRB, and FTRD. Using regsubsets we will choose the top 3 most significant predictors among these and see what the residuals look like.
# regsubsets can only take a few predictors
all = regsubsets(W ~ ADJDE + EFG_D + TORD + DRB + FTRD, data=cbb_reordered)
summary(all)
## Subset selection object
## Call: regsubsets.formula(W ~ ADJDE + EFG_D + TORD + DRB + FTRD, data = cbb_reordered)
## 5 Variables (and intercept)
## Forced in Forced out
## ADJDE FALSE FALSE
## EFG_D FALSE FALSE
## TORD FALSE FALSE
## DRB FALSE FALSE
## FTRD FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## ADJDE EFG_D TORD DRB FTRD
## 1 ( 1 ) "*" " " " " " " " "
## 2 ( 1 ) "*" " " " " " " "*"
## 3 ( 1 ) "*" " " " " "*" "*"
## 4 ( 1 ) " " "*" "*" "*" "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
# According to regsubsets, Adjusted Defensive Efficiency is the most important predictor of wins among defensive metrics. Trailing in second is Free Throw Rate Allowed, followed by the Offensive Rebound Rate Allowed. Seeing that these are the three most important defensive metrics for win rate, let's examine the residuals.
topthreemodel = lm(W ~ ADJDE + FTRD + DRB, data = cbb_reordered)
plot(topthreemodel)




# Yet again, the residuals are mostly centered at a zero mean, albeit with a slight curve at either end. Additionally, the QQ-line plot looks linear, implying that there is no skew in the residuals.
hist(topthreemodel$residuals)

# Looking at a histogram of the residuals shows they are still normally distributed. This implies that these three predictors are a solid replacement for the seven or so variables originally used to predict wins. Now we turn to see if these variables are equally as successful in predicting post season success.
cbb_reordered
# The average adjusted defensive efficiency, free throw rate allowed, and offensive rebounding rate allowed of teams qualifying for the postseason will be compared with those team that reach the Final Four to see if there are still meaningful differences in each number.
# Filter out teams that didn't make the tournament
cbb_postseason <- cbb_reordered %>% drop_na(POSTSEASON)
cbb_E8_or_better = subset(cbb_reordered, POSTSEASON == "Champions" | POSTSEASON == "F4" | POSTSEASON == "E8")
# Find average defensive efficiency of final four teams and compare to histogram of all postseason teams
postseason_average_DE = sum(cbb_postseason$ADJDE) / nrow(cbb_postseason)
E8_or_better_average_DE = sum(cbb_E8_or_better$ADJDE) / nrow(cbb_E8_or_better)
E8_or_better_average_DE
## [1] 92.22245
hist(cbb_postseason$ADJDE)
abline(v=E8_or_better_average_DE, col='red', lw=2)
abline(v=postseason_average_DE, col='blue', lw=2)

# As shown above, the teams that made the Elite Eight had a lower average defensive efficiency than teams that simply made the postseason. This implies that teams with a stronger defensive efficiency tended to make it farther in the postseason. This provides support that defensive efficiency continues to be a predictor of success outside of the regular season. It should be noted that a significance test would still need to be completed to determine if these results were simply down to chance or not. It does however provide a positive indication for the reliability of defensive efficiency for use in predicting postseason results.
postseason_average_FTRD = sum(cbb_postseason$FTRD) / nrow(cbb_postseason)
E8_or_better_average_FTRD = sum(cbb_E8_or_better$FTRD) / nrow(cbb_E8_or_better)
E8_or_better_average_FTRD
## [1] 31.49592
hist(cbb_postseason$FTRD)
abline(v=postseason_average_FTRD, col='blue', lw=2)
abline(v=E8_or_better_average_FTRD, col='red', lw=2)

# Yet again the mean free throw rate is lower among teams that progressed farther in the postseason. This implies that free throw rate is a solid predictor in both the regular and the postseason. A significance test would still be required to ensure the reproducibility of these results.
postseason_average_DRB = sum(cbb_postseason$DRB) / nrow(cbb_postseason)
E8_or_better_average_DRB = sum(cbb_E8_or_better$DRB) / nrow(cbb_E8_or_better)
E8_or_better_average_DRB
## [1] 29.15102
hist(cbb_postseason$DRB)
abline(v=postseason_average_DRB, col='blue', lw=2)
abline(v=E8_or_better_average_DRB, col='red', lw=2)

# The offensive rebound rate allowed is slightly higher among teams that progressed further in the postseason. This is a surprising result, considering one would expect that teams that are successful in March would allow few offensive rebounds. This implies that the offensive rebounding rate allowed is more equipped to measure regular season performance than postseason results.
# Conclusion: After a detailed analysis of the defensive metrics in D1 college basketball, there are three statistically significant metrics that stand out for predicting win rate in the regular season. These include the adjusted defensive efficiency, free throw rate allowed, and the offensive rebounding rate allowed. The linearity and normality of the residuals imply a strong relationship. The ability of these metrics was consistent into the postseason for adjusted defensive efficiency and free throw rate allowed. However, the offensive rebound rate allowed didn't seem to have a measurable impact on postseason success. Overall, at the very least a team's adjusted defensive efficiency and free throw rate allowed are a successful predictor of win rate in both the regular season and the postseason. The success of the adjusted defensive in particular bodes well for the continued use of KenPom ratings in the future to gauge the quality of a team.